Lets assume we have a shop that sells bananas
We buy bananas from one dealer who claims that he cultivates them organically in his own garden.
When we recieve our bananas do not look the same though, some small some thick some long... we observe variation.
There are 2 scenarios: 1) It is indeed one sort of bananas and the variation is natural 2) There are more than one sort of banans and our dealer probably buys them from third party dealers.
In [7]:
%matplotlib inline
import matplotlib.pyplot as plt
In [51]:
# Lets assume our dealer lies and indeed bananas come from elsewhere
from sklearn.datasets.samples_generator import make_blobs
bananas_dimentions = \
[[10, 3], # long - thin
[5, 2], # short - thin
[7.5, 5]] # middle -thick
bananas_dimentions_std = 1.0
n_bananas = 1000
X, banana_labels = make_blobs(n_samples=n_bananas,
centers=bananas_dim,
cluster_std=bananas_dimentions_std)
In [52]:
# Now lets pretend all we have are the n_bananas above all in the same basket
# We know nothing about the origin
# Lets plot what we see in term of thickness / length
plt.scatter(X[:,0], X[:,1])
Out[52]:
In [53]:
# At first look something looks suspicious
# lets perform some clustering to see what we can measure
In [57]:
# Since we do not know if our bananas come from different places,
# we can try to assign different number of clusters and evaluate some metric
def perform_k_means(X, n_clusters):
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
y_pred = KMeans(n_clusters=n_clusters)\
.fit_predict(X)
return y_pred
In [63]:
# Lets see some plots:
# dymmy case no cluster
y_pred = perform_k_means(X, 1)
plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
y_pred = perform_k_means(X, 2)
plt.subplot(222)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
y_pred = perform_k_means(X, 3)
plt.subplot(223)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
y_pred = perform_k_means(X, 4)
plt.subplot(224)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
Out[63]:
In [64]:
# At this point we need some evaluation phase
# We need a metric that can give us hints which of the above clusters
# performs better
In [65]:
# ... TODO coming in the next episode :-)
In [ ]: